Single-Cell Data Analysis - hands-on session
@CSoneson@FedeBioinfoParts of the material in this tutorial were developed in collaboration with Panagiotis Papasaikas, FMI Computational Biology. Some other parts are taken from, or inspired by, the ‘Orchestrating Single-Cell Analysis with Bioconductor’ book, available from https://bioconductor.org/books/release/OSCA/.
For this tutorial, we will use an example data set from the TENxPBMCData Bioconductor package. This package provides access to a number of PBMC data sets generated with the 10x Genomics Chromium platform. Here we will use one containing approximately 3,000 cells. The SingleCellExperiment object included in the package contains gene annotations, including the Ensembl gene ID and the gene symbol; to make it easier to interpret the results we use the gene symbol wherever possible.
library(TENxPBMCData)
library(SingleCellExperiment)
sce <- TENxPBMCData("pbmc3k")
rownames(sce) <- scater::uniquifyFeatureNames(
ID = rowData(sce)$ENSEMBL_ID,
names = rowData(sce)$Symbol_TENx
)
sce
# class: SingleCellExperiment
# dim: 32738 2700
# metadata(0):
# assays(1): counts
# rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
# rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
# colnames: NULL
# colData names(11): Sample Barcode ... Individual Date_published
# reducedDimNames(0):
# mainExpName: NULL
# altExpNames(0):
The loaded object has no column names, so we set the column names to be the cell barcodes. First, we check that they are all unique. What would you do if they were not (and why might that happen)?
all(gsub("-1", "", sce$Barcode) == sce$Sequence)
# [1] TRUE
any(duplicated(sce$Sequence))
# [1] FALSE
colnames(sce) <- sce$Sequence
We can see that sce is indeed a SingleCellExperiment object, and that the
structure is similar to that of a SummarizedExperiment object.
sce
# class: SingleCellExperiment
# dim: 32738 2700
# metadata(0):
# assays(1): counts
# rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
# rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
# colnames(2700): AAACATACAACCAC AAACATTGAGCTAC ... TTTGCATGAGAGGC
# TTTGCATGCCTCAC
# colData names(11): Sample Barcode ... Individual Date_published
# reducedDimNames(0):
# mainExpName: NULL
# altExpNames(0):
Accessing the assay data, row and column annotations is done in the same way as
for SummarizedExperiment objects. There is also an additional counts()
accessor for the counts assay.
While the structure of the scRNA-seq data is similar to that of the bulk data, there are also important differences that affect the downstream analysis. One of these differences is that single-cell data is much more sparse; in other words, there are many more zeros in the count matrix from a single-cell experiment than from a bulk experiment. This is due to things such as:
Let’s check the fraction of zeros in our count matrix:
mean(counts(sce) == 0)
# [1] 0.9741281
We also calculate the range of library sizes, noting that these are much smaller than the typical values for bulk RNA-seq:
summary(colSums(counts(sce)))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 548 1758 2197 2367 2763 15844
The number of cells in a scRNA-seq data set is typically much (several orders of magnitude) larger than the number of samples in a bulk RNA-seq experiment. Hence, the count matrices can get very large. However, since most of the values are zero, efficient storage modes, where only the non-zero values and the corresponding matrix positions are stored, can be employed. We can make sure that the count matrix in our object is indeed such a sparse matrix (in this particular data set, it is actually provided as a DelayedMatrix, which is beyond the scope of this course to discuss in detail, but which can be suitable for very large data matrices that do not fit in memory).
class(counts(sce))
# [1] "DelayedMatrix"
# attr(,"package")
# [1] "DelayedArray"
counts(sce) <- as(counts(sce), "dgCMatrix")
class(counts(sce))
# [1] "dgCMatrix"
# attr(,"package")
# [1] "Matrix"
counts(sce)[1:10, 1:10]
# 10 x 10 sparse Matrix of class "dgCMatrix"
#
# MIR1302-10 . . . . . . . . . .
# FAM138A . . . . . . . . . .
# OR4F5 . . . . . . . . . .
# RP11-34P13.7 . . . . . . . . . .
# RP11-34P13.8 . . . . . . . . . .
# AL627309.1 . . . . . . . . . .
# RP11-34P13.14 . . . . . . . . . .
# RP11-34P13.9 . . . . . . . . . .
# AP006222.2 . . . . . . . . . .
# RP4-669L17.10 . . . . . . . . . .
We already noted a couple of differences between scRNA-seq data and bulk data: the former typically contains many more observations, and the count matrix is much more sparse. The low amount of starting material for scRNA-seq experiments also results in a high sampling noise, and a lower correlation among cells than among bulk RNA-seq samples. This can be seen in a scatter plot of the observed counts for two randomly selected cells in our example data set:
## Scatter plot of two cells with similar library sizes
idx <- order(abs(colSums(counts(sce)) -
median(colSums(counts(sce)))))[1:2]
colSums(counts(sce))[idx]
# ACGGTCCTAACGGG CGGGCATGTTGTGG
# 2197 2197
plot(counts(sce)[, idx[1]] + 1, counts(sce)[, idx[2]] + 1, log = "xy")
Note that there are many genes that are not observed in one of the cells, but still have a high number of assigned counts in the other cell.
How many genes are detected in at least one cell? Compare that to the number of genes detected in each of the individual cells. What does this tell you?
## Total number of detected genes
sum(rowSums(counts(sce)) > 0)
# [1] 16634
## Genes detected in single cells
summary(colSums(counts(sce) > 0))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 212.0 690.0 817.0 847.0 953.2 3422.0
This indicates that the gene detection is to some extent random, as expected by the sampling process. In other words, it is not always the same genes that go undetected in all cells.
While we don’t observe each gene in each cell, there is still a clear association between the overall expression level of a gene (e.g., total count) and the fraction of cells where it is detected.
## Dropout plot
plot(rowSums(counts(sce)), rowMeans(counts(sce) == 0), log = "x")
To simplify further QC, we use the Bioconductor package scater to calculate a number of summary QC statistics, both for the cells and for the genes. By providing a subset of genes (here, mitochondrial genes), we can calculate the fraction of counts falling in these genes.
library(scater)
mt <- rownames(sce)[grep("^MT-", rowData(sce)$Symbol_TENx)]
mt
# [1] "MT-ND1" "MT-ND2" "MT-CO1" "MT-CO2" "MT-ATP8" "MT-ATP6" "MT-CO3"
# [8] "MT-ND3" "MT-ND4L" "MT-ND4" "MT-ND5" "MT-ND6" "MT-CYB"
sce <- addPerFeatureQC(sce)
sce <- addPerCellQC(sce, subsets = list(MT = mt))
This adds a number of columns to the rowData and colData slots of the
SingleCellExperiment object:
colnames(rowData(sce))
# [1] "ENSEMBL_ID" "Symbol_TENx" "Symbol" "mean" "detected"
colnames(colData(sce))
# [1] "Sample" "Barcode" "Sequence"
# [4] "Library" "Cell_ranger_version" "Tissue_status"
# [7] "Barcode_type" "Chemistry" "Sequence_platform"
# [10] "Individual" "Date_published" "sum"
# [13] "detected" "subsets_MT_sum" "subsets_MT_detected"
# [16] "subsets_MT_percent" "total"
For example, we can plot the distribution of library sizes (sum) and the
number of detected genes (detected) across the cells
hist(log10(sce$sum), breaks = 30)
hist(sce$detected, breaks = 30)
Finally, we can look at the set of genes accounting for the majority of the counts in the data set.
scater::plotHighestExprs(sce, n = 15)
The proportion of counts assigned to mitochondrial genes is another useful indicator of cell quality, since high numbers of such reads can be associated to cell damage.
hist(sce$subsets_MT_percent, breaks = 30)
Now that we have calculated the QC metrics, we will use them to filter out low-quality cells that will be excluded from the rest of the analysis. The optimal parameters for filtering are debated and likely data set dependent, but a typical approach is to remove cells that fall ‘too far’ from the average cells on one or more of the considered criteria. This makes the implicit assumption that ‘most’ cells are of good quality, which is often sensible. It should also be noted that in some cases, cells that seem to be of bad quality can do so for biological reasons. For example, certain cell types express very few genes, or have a high metabolic rate and consequently express a lot of mitochondrial genes.
Here, we will exclude cells according to two criteria:
For each of these criteria, we exclude cells that are more than 4 median absolute deviations (MAD) from the median across cells, in the direction indicating low quality.
low_detected <- isOutlier(sce$detected, type = "lower",
log = TRUE, nmads = 4)
high_mt <- isOutlier(sce$subsets_MT_percent, type = "higher",
log = FALSE, nmads = 4)
plot(rank(-sce$detected), sce$detected, col = low_detected + 1)
plot(rank(sce$subsets_MT_percent), sce$subsets_MT_percent,
col = high_mt + 1)
We filter out the cells identified as being of low quality according to the thresholds defined above.
sce$retain <- !low_detected & !high_mt
sce <- sce[, sce$retain]
dim(sce)
# [1] 32738 2637
Just as for bulk RNA-seq, the raw scRNA-seq counts are not directly comparable
across cells due to, e.g., differences in library sizes. Thus, we need to apply
a normalization strategy. There are many approaches to normalization of
scRNA-seq data (see e.g. Lytal, Ran, and An (2020) and Cole et al. (2019)
for comparisons); here, we will use one implemented in the scran
package. Similarly to the TMM and DESeq normalization approaches that we have
discussed previously, this works by estimating a size factor for each cell,
which incorporates the library size as well as a measure of the RNA composition.
The bulk RNA-seq methods are sometimes struggling with scRNA-seq due to the
large number of zeros; the scran approach solves this by repeatedly pooling
multiple cells (which reduces the number of zeros), calculating size factors
for the pools, and deriving individual size factors for the cells from those of
the pools. After calculating the size factors, we normalize the observed counts
and log-transform the values. The new “log counts” are placed in a new assay
in sce, named logcounts.
library(scran)
sce <- computeSumFactors(sce, min.mean = 0.1)
sce <- logNormCounts(sce)
assayNames(sce)
# [1] "counts" "logcounts"
We plot the estimated size factors against the total count for each cell. Is there an association? Is this what you expected?
plot(sce$sum, sizeFactors(sce))
Variation in gene abundance estimates between different cells can be thought of
as the convolution of the technical (mainly sampling) and the biological (e.g
cell type) sources of variance. Typically one wants to isolate and focus on the
biological variance so that differences due to experimental noise have as small
an impact as possible on subsequent analyses.
There are different approaches to disentangling the technical and biological
variability in a single-cell data set. Some of these assume that “most” genes
exhibit only technical variability (i.e., most genes are not differentially
expressed between different studied cell types). Other approaches assume that
the technical variance follows a Poisson distribution (a common distribution
capturing sampling variability in count data), and that deviations from the
Poisson expectation corresponds to biological variability.
## Fit a trend
dec.trend <- modelGeneVar(sce)
fit.trend <- metadata(dec.trend)
plot(fit.trend$mean, fit.trend$var, xlab = "Mean of log-expression",
ylab = "Variance of log-expression")
curve(fit.trend$trend(x), col = "dodgerblue", add = TRUE, lwd = 2)
## Poisson assumption
dec.pois <- modelGeneVarByPoisson(sce)
plot(dec.pois$mean, dec.pois$total, xlab = "Mean of log-expression",
ylab = "Variance of log-expression")
curve(metadata(dec.pois)$trend(x), col = "dodgerblue", add = TRUE)
In each case, the biological variance of a gene can be estimated as the
difference between the total variance and the modeled technical variance.
This value can be used to rank the genes, and to select a set of “highly
variable” genes to use as the basis of further downstream analysis. Here, we
select the top 1,000 genes based on the trend fit, and add these to the
metadata of the SingleCellExperiment object.
head(dec.trend[order(dec.trend$bio, decreasing = TRUE), ])
# DataFrame with 6 rows and 6 columns
# mean total tech bio p.value FDR
# <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
# LYZ 1.63952 3.90841 0.883653 3.02476 4.26274e-102 2.35971e-98
# S100A9 1.07085 3.36583 0.778017 2.58781 1.58787e-96 6.59245e-93
# HLA-DRA 1.57243 3.17372 0.879317 2.29440 3.08079e-60 5.11627e-57
# FTL 3.67644 2.86164 0.693417 2.16822 1.45835e-85 4.84376e-82
# CD74 2.24359 2.75907 0.851566 1.90750 6.07457e-45 5.60447e-42
# CST3 1.10758 2.58206 0.790017 1.79204 4.90571e-46 4.79230e-43
head(dec.pois[order(dec.pois$bio, decreasing = TRUE), ])
# DataFrame with 6 rows and 6 columns
# mean total tech bio p.value FDR
# <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
# LYZ 1.63952 3.90841 0.656248 3.25216 0 0
# S100A9 1.07085 3.36583 0.654024 2.71180 0 0
# FTL 3.67644 2.86164 0.213746 2.64790 0 0
# HLA-DRA 1.57243 3.17372 0.665876 2.50784 0 0
# CD74 2.24359 2.75907 0.525008 2.23406 0 0
# FTH1 3.51660 2.19825 0.239881 1.95837 0 0
hvg.var <- getTopHVGs(dec.trend, n = 1000)
head(hvg.var)
# [1] "LYZ" "S100A9" "HLA-DRA" "FTL" "CD74" "CST3"
metadata(sce)$hvgs <- hvg.var
For bulk RNA-seq data, we typically use PCA for visualization and exploratory analysis. This can of course also be applied to single-cell RNA-seq data. However, other methods (e.g., tSNE and UMAP) are more commonly used for scRNA-seq. Both tSNE and UMAP are non-linear dimension reduction methods, which focus to a larger extent on retaining small distances. That means that cells that are similar to each other (e.g., cells of the same cell type) tend to be placed close together in the low-dimensional representation, whereas larger distances are potentially less faithfully represented. PCA, on the other hand, tends to put more focus on preserving the large cell-to-cell distances. Importantly, even though PCA is rarely used for visualization of large scRNA-seq data sets in two-dimensional plots, it is often used as a first dimension reduction step, and other approaches such as tSNE and UMAP are subsequently applied to the PCA output.
Here, we therefore first apply PCA to our data set. We supply the set of
highly variable genes derived above, to only calculate the principal components
based on these. By default, the runPCA function from the
scater package will apply the PCA to the ‘logcounts’ assay.
We plot the first two principal components (note that we extract 50 components,
which will be used as the basis for the tSNE later) and color by the number of
detected genes.
sce <- runPCA(sce, exprs_values = "logcounts", ncomponents = 50,
subset_row = metadata(sce)$hvgs)
sce
# class: SingleCellExperiment
# dim: 32738 2637
# metadata(1): hvgs
# assays(2): counts logcounts
# rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
# rowData names(5): ENSEMBL_ID Symbol_TENx Symbol mean detected
# colnames(2637): AAACATACAACCAC AAACATTGAGCTAC ... TTTGCATGAGAGGC
# TTTGCATGCCTCAC
# colData names(19): Sample Barcode ... retain sizeFactor
# reducedDimNames(1): PCA
# mainExpName: NULL
# altExpNames(0):
reducedDimNames(sce)
# [1] "PCA"
plotReducedDim(sce, "PCA", colour_by = "detected")
Next, we apply tSNE, which is a non-linear, stochastic dimension reduction technique that attempts to find a mapping of the data on a low subspace while preserving local distances between cells. The non-linear character of tSNE means that it will often produce projections that better resolve differences between cell groups. The better separation of tSNE comes at the cost of interpretability; while in a tSNE representation similar cells are placed close to each other, longer distances in the representation are not guaranteed to reflect true relationships. This means that it is risky to draw conclusions of “similarity” or “dissimilarity” from the positional relationships of different cell groupings that appear in a tSNE plot. In addition, the stochastic nature of tSNE means that every time the algorithm is applied a different representation will be produced unless a random seed is set.
set.seed(123)
sce <- runTSNE(sce, dimred = "PCA")
reducedDimNames(sce)
# [1] "PCA" "TSNE"
plotReducedDim(sce, "TSNE", colour_by = "detected")
We can often identify specific cell types and gain some understanding of the data directly from the visualization. Typically, this would be done by colouring the cells according to the expression level of known marker genes for certain cell types.
## Color by expression of a B-cell marker (CD20)
plotReducedDim(sce, "TSNE", colour_by = "MS4A1")
## Color by expression of a T-cell marker
plotReducedDim(sce, "TSNE", colour_by = "CD3D")
## Color by expression of a monocyte marker
plotReducedDim(sce, "TSNE", colour_by = "LYZ")
## Color by expression of a platelet marker
plotReducedDim(sce, "TSNE", colour_by = "PPBP")
As we have seen previously, exploratory analysis is very important for high-throughput data analysis. Performing this in an interactive way, rather than via static QC plots, can often be more efficient. Within the Bioconductor framework, one way to do this is via the iSEE package, which directly takes a SummarizedExperiment (or a derivative such as a SingleCellExperiment) object as input.
library(iSEE)
iSEE(sce)
There is a plethora of methods available for clustering scRNA-seq data (see e.g. Duò, Robinson, and Soneson (2018) for a comparison). So called graph-based clustering methods are commonly used and often provide good results, and we will illustrate here how they can be applied.
First, we will use scran to generate the shared nearest neighbor
graph, which will then be subjected to community detection using algorithms
implemented in the igraph package. The SNN graph is constructed
using the buildSNNGraph function in scran, given the input
space to use (here, we use the PCA representation calculated above) and the
number of neighbors to use in the original KNN graph generation. We also specify
the type of weighting to use when generating the SNN graph. The default is type = "rank", which sets the weight between two nodes to k - r/2, where r is
the smallest sum of ranks for any shared neighbors (Xu and Su 2015). Alternatively,
type = "number" sets the weight to the number of shared neighbors.
graph_k10 <- scran::buildSNNGraph(sce, k = 10, use.dimred = "PCA", type = "rank")
Once the SNN graph is generated, we can use any of the community detection
algorithms in igraph to find the clusters. Here, we illustrate
two of these methods; the walktrap algorithm (Pons and Latapy 2005) and the Louvain
method (Blondel et al. 2008). The cluster assignments are included in the
membership slot of the communities object returned by the community
detection.
clust_k10_walktrap <- igraph::cluster_walktrap(graph_k10)$membership
clust_k10_louvain <- igraph::cluster_louvain(graph_k10)$membership
There are several ways in which we can explore the clustering results further. For example, we can look at the number of inferred communities, and the number of cells assigned to each of them:
table(clust_k10_walktrap)
# clust_k10_walktrap
# 1 2 3 4 5 6 7 8 9 10 11 12
# 158 175 198 819 159 29 345 13 268 7 351 115
table(clust_k10_louvain)
# clust_k10_louvain
# 1 2 3 4 5 6 7 8
# 343 343 568 483 160 157 553 30
We can also compare the assignments between the two community detection methods, both numerically and graphically. A common metric for summarizing the agreement between two partitionings of the same set of cells is the adjusted Rand index (Rand 1971; Hubert and Arabie 1985) - the closer to 1 this value is, the more similar are the partitionings.
table(clust_k10_walktrap, clust_k10_louvain)
# clust_k10_louvain
# clust_k10_walktrap 1 2 3 4 5 6 7 8
# 1 5 0 0 0 153 0 0 0
# 2 0 0 0 18 0 157 0 0
# 3 0 0 0 197 0 0 0 1
# 4 50 0 567 0 0 0 202 0
# 5 159 0 0 0 0 0 0 0
# 6 0 0 0 0 0 0 0 29
# 7 1 343 1 0 0 0 0 0
# 8 13 0 0 0 0 0 0 0
# 9 0 0 0 268 0 0 0 0
# 10 0 0 0 0 7 0 0 0
# 11 0 0 0 0 0 0 351 0
# 12 115 0 0 0 0 0 0 0
pheatmap::pheatmap(table(clust_k10_walktrap, clust_k10_louvain))
mclust::adjustedRandIndex(clust_k10_walktrap, clust_k10_louvain)
# [1] 0.6521567
Finally, we often want to overlay the cluster assignments in a reduced dimension
representation, or in the original graph. One way of achieving the former is to
add the cluster labels to the SingleCellExperiment object, and use the
plotReducedDim function from scater to visualize the data. The
latter can be achieved using functions from igraph.
## Add cluster assignments to the SingleCellExperiment object and visualize in
## tSNE representation
sce$cluster_walktrap_k10 <- factor(clust_k10_walktrap)
sce$cluster_louvain_k10 <- factor(clust_k10_louvain)
scater::plotReducedDim(sce, "TSNE", colour_by = "cluster_walktrap_k10")
scater::plotReducedDim(sce, "TSNE", colour_by = "cluster_louvain_k10")
Seurat also implements a graph-based clustering, by default using the Louvain community detection algorithm. Since Seurat does not use the SingleCellExperiment container, the first thing we need to do is to create a Seurat object for the downstream analysis. Next, we create the nearest neighbor graph, and find the communities. Note that Seurat allows the specification of the resolution parameter. This will (implicitly) determine the number of communities. Here, we specify a range of resolutions, which will generate a collection of clustering results. Finally, we can move the cluster labels back into the original SingleCellExperiment object, for further exploration (which can of course also be done using functions from Seurat).
so <- Seurat::as.Seurat(sce, counts = "counts", data = "logcounts")
so <- Seurat::FindNeighbors(so, reduction = "PCA", k.param = 20,
dim = seq_len(ncol(reducedDim(sce, "PCA"))))
so <- Seurat::FindClusters(so, random.seed = 123, verbose = FALSE,
resolution = c(0.05, 0.1, 0.2, 0.4, 0.6, 1.0))
library(dplyr)
stopifnot(all(rownames(so@meta.data) == rownames(colData(sce))))
colData(sce) <- cbind(colData(sce),
so@meta.data %>% dplyr::select(contains("snn_res")))
First, a cautionary note: some care should be taken when interpreting the p-values from any statistical test applied in this context, since the testing is performed on the same data that is used to extract the clusters in the first place. Thus, almost by construction, there will be some genes that are differentially expressed between the different clusters.
The t-test is a natural choice for comparing observed expression levels in two groups (e.g., clusters). It has been shown to be competitive also in terms of performance on various types of scRNA-seq data (Soneson and Robinson 2018).
The scran package contains a function named pairwiseTTests,
which will, as the name suggests, perform a t-test between each pair of
clusters. The input is a matrix of normalized, log-transformed expression
values, and a vector of cluster labels. The output of this function call is a
list with two elements: statistics and pairs. Each element of statistics
is a DataFrame giving the results of the applied test for a given pair of
clusters (the corresponding pair is obtained from the pairs object). The
direction argument specifies whether we are interested in genes regulated in
any direction, or only up- or down-regulated genes, respectively.
pwtt <- scran::pairwiseTTests(
x = logcounts(sce), groups = sce$cluster_louvain_k10,
direction = "up"
)
names(pwtt)
# [1] "statistics" "pairs"
length(pwtt$statistics) ## number of pairs
# [1] 56
head(pwtt$statistics[[1]]) ## results from first pairwise test
# DataFrame with 6 rows and 3 columns
# logFC p.value FDR
# <numeric> <numeric> <numeric>
# MIR1302-10 0.00000000 0.500000 0.621894
# FAM138A 0.00000000 0.500000 0.621894
# OR4F5 0.00000000 0.500000 0.621894
# RP11-34P13.7 0.00000000 0.500000 0.621894
# RP11-34P13.8 0.00000000 0.500000 0.621894
# AL627309.1 0.00343217 0.159009 0.621894
head(pwtt$pairs) ## clusters compared in each pair
# DataFrame with 6 rows and 2 columns
# first second
# <character> <character>
# 1 1 2
# 2 1 3
# 3 1 4
# 4 1 5
# 5 1 6
# 6 1 7
While the pairwiseTTests function (and the similar pairwiseWilcox function
for the Wilcoxon test) provides a very convenient and efficient way of
performing all pairwise comparisons, in practice we often want to summarize or
combine the results across several of these comparisons. For example,
we may be interested in finding genes that are
upregulated in a specific cluster compared to each of the other clusters, or
compared to at least one of them. The function combineMarkers from
scran was written for this purpose, and allows the user to combine
the list of pairwise results in several ways. For example, in order to test, for
each cluster, whether each gene is significantly upregulated with respect to
all other clusters, we can do:
cbm_all <- scran::combineMarkers(
de.lists = pwtt$statistics, pairs = pwtt$pairs,
pval.type = "all"
)
The result of this function call is a list, containing one DataFrame for each original cluster. This DataFrame contains, in addition to the logFCs compared to each of the other clusters, a nominal and an adjusted p-value testing the hypothesis that the gene is not DE in all the contrasts involving the cluster of interest. Thus, the top-ranked markers for a given cluster can be seen as “specific” marker genes for that cluster.
It is often helpful from an interpretation point of view to explore the detected marker genes visually. scater contains many useful functions for creating such static plots, and other packages like iSEE can be used for interactive exploration. Here, we illustrate how to show the expression of marker genes across cells in the various clusters, as well as on top of a reduced dimension representation. We also make a heatmap showing the expression levels of the top two marker genes for each cluster.
head(cbm_all[["2"]])
# DataFrame with 6 rows and 10 columns
# p.value FDR summary.logFC logFC.1 logFC.3 logFC.4
# <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
# MS4A1 8.73518e-107 2.85972e-102 1.708085 1.665541 1.672384 1.668763
# LINC00926 1.83405e-50 3.00215e-46 0.983469 0.984509 0.983000 0.984770
# CD79B 7.16696e-44 7.82106e-40 2.108759 2.107895 2.103941 2.153856
# VPREB3 1.24255e-39 1.01696e-35 0.781242 0.791596 0.789674 0.788134
# TCL1A 1.21764e-38 7.97263e-35 1.441181 1.469250 1.460292 1.473899
# CD79A 1.15440e-31 6.29877e-28 2.256432 2.364448 2.348915 2.369224
# logFC.5 logFC.6 logFC.7 logFC.8
# <numeric> <numeric> <numeric> <numeric>
# MS4A1 1.672373 1.666617 1.674825 1.708085
# LINC00926 0.990760 0.983469 0.985680 0.995798
# CD79B 2.005679 1.573217 2.152635 2.108759
# VPREB3 0.799893 0.792279 0.795387 0.781242
# TCL1A 1.469384 1.457930 1.471754 1.441181
# CD79A 2.393955 2.369584 2.361339 2.256432
scater::plotExpression(sce, features = c("CD79A", "MS4A1"),
x = "cluster_louvain_k10")
cowplot::plot_grid(scater::plotTSNE(sce, colour_by = "CD79A"),
scater::plotTSNE(sce, colour_by = "MS4A1"))
scater::plotHeatmap(sce, features = unique(unlist(lapply(cbm_all, function(w) rownames(w)[1:2]))),
columns = colnames(sce)[order(sce$cluster_louvain_k10)],
colour_columns_by = "cluster_louvain_k10", cluster_cols = FALSE,
show_colnames = FALSE, cluster_rows = FALSE)
It is worth pointing out that in practice, we do not need to go through the two
individual steps above (first doing all pairwise tests, and then combining the
results). The findMarkers function from scran will do this for
us, given a specification of how we wish to combine the results across
contrasts. To illustrate this, we instead search for genes that are upregulated
in a cluster compared to any of the other clusters (i.e., testing the null
hypothesis that the gene is not DE in any of the contrasts).
cbm_any <- scran::findMarkers(
sce, groups = sce$cluster_louvain_k10,
pval.type = "any", direction = "up"
)
Again, the output of the above command is a list of DataFrames, one for each
cluster. Each DataFrame contains the logFC with respect to each other cluster,
and a nominal and an adjusted p-value. There is also a column named Top, which
gives the minimum rank for the gene across all pairwise comparisons. For
example, if Top = 1, the gene is the top-ranked one in at least one comparison
of the cluster of interest to the other clusters.
To illustrate the difference between the two types of tests, we plot the p-values obtained when comparing cluster 2 to the other clusters with the two approaches.
## p-values for the pval.type = "all" comparison
df_all <- as.data.frame(cbm_all[["2"]]) %>%
tibble::rownames_to_column("gene") %>%
dplyr::select(gene, p.value) %>%
setNames(c("gene", "p.value.all"))
## p-values for the pval.type = "any" comparison
df_any <- as.data.frame(cbm_any[["2"]]) %>%
tibble::rownames_to_column("gene") %>%
dplyr::select(gene, p.value) %>%
setNames(c("gene", "p.value.any"))
## Merge and plot
df <- dplyr::inner_join(df_all, df_any, by = "gene")
ggplot(df, aes(x = p.value.all, y = p.value.any)) +
geom_point() + scale_x_log10() + scale_y_log10() +
theme_bw()
We see that while there are several genes that are strongly significant in both
types of analyses, there are other genes that are only strongly significant with
pval.type = "any". We look at one of the genes that are among the top-ranked
ones in both types of comparisons, and one of the genes that is top-ranked only
in the "any" approach.
genes <- c("CD79A", "RPS16")
subset(df, gene %in% genes)
# gene p.value.all p.value.any
# 6 CD79A 1.154396e-31 3.558342e-147
# 32211 RPS16 1.000000e+00 2.446925e-42
scater::plotExpression(sce, features = genes, x = "cluster_louvain_k10")
Note the difference between a gene that is upregulated in cluster 2 compared to all other clusters, and one that is upregulated to at least one other cluster.
While pval.type = "all" will, as just illustrated, allow us to detect marker
genes that are specific to a given cluster, there are important pitfalls to be
aware of. In order to illustrate one of these, we artificially split the cells in
cluster 2 into two clusters (call one of the groups “2b”), and redo the test to
find genes that are upregulated in cluster 2 compared to all other clusters.
tmp <- as.character(sce$cluster_louvain_k10)
set.seed(123)
tmp[sample(which(tmp == "2"), sum(tmp == "2")/2)] <- "2b"
sce$cluster_louvain_k10_mod <- factor(tmp)
cbm_all_mod <- scran::findMarkers(
sce, groups = sce$cluster_louvain_k10_mod,
pval.type = "all"
)
cbm_all_mod[["2"]]["CD79A", ]
# DataFrame with 1 row and 11 columns
# p.value FDR summary.logFC logFC.1 logFC.2b logFC.3 logFC.4
# <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
# CD79A 0.514575 1 0.0710252 2.39986 0.0710252 2.38432 2.40463
# logFC.5 logFC.6 logFC.7 logFC.8
# <numeric> <numeric> <numeric> <numeric>
# CD79A 2.42936 2.40499 2.39675 2.29184
scater::plotExpression(sce, features = "CD79A", x = "cluster_louvain_k10_mod")
Note that now, the gene that was strongly upregulated in cluster 2 compared to all other clusters before, is no longer so, since it is expressed also in cluster 2b. This is important to consider in cases where the data may be overclustered, and thus there may be several clusters corresponding to the same underlying cell type. This could also happen, for example, if clusters 2 and 2b were different types of T-cells - no general T-cell markers would be upregulated in any of these clusters compared to all the other clusters.
Note that the findMarkers function provides an interface to several
different types of tests.
In addition to the default t-test, it is possible to perform Wilcoxon tests or
binomial tests (testing for differences in the presence/absence pattern of genes
between clusters) using the same interface, by specifying the test argument.
In the presence of strong batch effects (e.g., when cells come from different
studies or are prepared in multiple batches), these should be accounted for in
the differential expression analysis. One way of doing this is to use the
block argument of findMarkers, which effectively performs the cluster
comparisons in each batch, and subsequently combines the results into a single
p-value.
Cell type prediction methods attempt to assign a cell type label to each cell in a data set based on its similarity to cells of the same type in a labeled reference data set. Here, we will illustrate how to perform such an analysis using the SingleR package.
## Load reference dataset
library(celldex)
library(SingleR)
ref <- celldex::MonacoImmuneData()
## Predict labels
pred <- SingleR::SingleR(test = sce, ref = ref, labels = ref$label.main)
table(pred$labels)
#
# B cells CD4+ T cells CD8+ T cells Dendritic cells Monocytes
# 349 890 293 41 630
# NK cells Progenitors T cells
# 160 11 263
## Add assigned labels to original object
sce$singler_labels <- pred$labels
sce$singler_pruned_labels <- pred$pruned.labels
# sce$singler_diffscore <- pred$tuning.scores$first - pred$tuning.scores$second
sce
# class: SingleCellExperiment
# dim: 32738 2637
# metadata(1): hvgs
# assays(2): counts logcounts
# rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
# rowData names(5): ENSEMBL_ID Symbol_TENx Symbol mean detected
# colnames(2637): AAACATACAACCAC AAACATTGAGCTAC ... TTTGCATGAGAGGC
# TTTGCATGCCTCAC
# colData names(30): Sample Barcode ... singler_labels
# singler_pruned_labels
# reducedDimNames(2): PCA TSNE
# mainExpName: NULL
# altExpNames(0):
## Plot association with cluster labels
library(pheatmap)
table(sce$singler_pruned_labels, sce$cluster_walktrap_k10, useNA = "ifany")
#
# 1 2 3 4 5 6 7 8 9 10 11 12
# B cells 0 0 1 0 0 0 340 5 1 0 0 0
# CD4+ T cells 0 0 0 656 1 0 0 0 0 0 227 1
# CD8+ T cells 1 0 0 84 80 0 1 7 0 0 107 12
# Dendritic cells 0 0 5 0 0 27 2 1 6 0 0 0
# Monocytes 0 175 192 0 0 2 0 0 261 0 0 0
# NK cells 150 0 0 0 9 0 0 0 0 0 0 0
# Progenitors 0 0 0 1 0 0 0 0 0 7 3 0
# T cells 7 0 0 72 68 0 1 0 0 0 13 102
# <NA> 0 0 0 6 1 0 1 0 0 0 1 0
pheatmap(table(sce$singler_pruned_labels, sce$cluster_walktrap_k10,
useNA = "ifany"))
plotReducedDim(sce, "TSNE", colour_by = "singler_pruned_labels")
In this tutorial we have used Bioconductor packages for the analysis, and represented the data in a SingleCellExperiment container. Just as for bulk RNA-seq, there are also other containers that are often used for single-cell data. The most common alternative is the Seurat object, which is the basis for the workflow provided by the Seurat package. This package provides to a large extent similar capabilities as the Bioconductor packages we have seen in this lecture, and can be used as an alternative. The webpage contains a collection of tutorials (including one for the same data set that we studied here).
iSEE my dataiSEE can offer an extreme variety of plots and visual representations of single cell datasets, especially linking the panels together to perform fine-grained analyses otherwise not possible with “simple static plots”.
We will refer to the material in https://isee.github.io/iSEEWorkshopEuroBioc2020/ for more examples (self-contained “recipes” that cover potential typical use cases).
An example of what iSEE can do with larger collections of data is available here:
https://github.com/iSEE/iSEE_instances/
It can be also used as a powerful portal to tell a story about your data, like we did in http://shiny.imbei.uni-mainz.de:3838/covid_IT
sessionInfo()
# R version 4.4.0 (2024-04-24)
# Platform: x86_64-apple-darwin20
# Running under: macOS Monterey 12.7.1
#
# Matrix products: default
# BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# time zone: Europe/Berlin
# tzcode source: internal
#
# attached base packages:
# [1] stats4 stats graphics grDevices utils datasets methods
# [8] base
#
# other attached packages:
# [1] SingleR_2.6.0 celldex_1.14.0
# [3] scran_1.32.0 scater_1.32.0
# [5] scuttle_1.14.0 TENxPBMCData_1.22.0
# [7] HDF5Array_1.32.0 rhdf5_2.48.0
# [9] DelayedArray_0.30.1 SparseArray_1.4.8
# [11] S4Arrays_1.4.1 abind_1.4-5
# [13] Matrix_1.7-0 lubridate_1.9.3
# [15] forcats_1.0.0 stringr_1.5.1
# [17] dplyr_1.1.4 purrr_1.0.2
# [19] readr_2.1.5 tidyr_1.3.1
# [21] tibble_3.2.1 tidyverse_2.0.0
# [23] ggthemes_5.1.0 gapminder_1.0.0
# [25] ggplot2_3.5.1 MASS_7.3-61
# [27] clusterProfiler_4.12.0 topGO_2.56.0
# [29] SparseM_1.83 GO.db_3.19.1
# [31] graph_1.82.0 GeneTonic_2.8.0
# [33] iSEEu_1.16.0 iSEEhex_1.6.0
# [35] iSEE_2.16.0 SingleCellExperiment_1.26.0
# [37] pheatmap_1.0.12 apeglm_1.26.1
# [39] pcaExplorer_2.30.0 bigmemory_4.6.4
# [41] edgeR_4.2.0 limma_3.60.3
# [43] ExploreModelMatrix_1.16.0 GenomicFeatures_1.57.0
# [45] org.Hs.eg.db_3.19.1 AnnotationDbi_1.66.0
# [47] DESeq2_1.44.0 SummarizedExperiment_1.34.0
# [49] Biobase_2.64.0 MatrixGenerics_1.16.0
# [51] matrixStats_1.3.0 GenomicRanges_1.56.1
# [53] GenomeInfoDb_1.40.1 IRanges_2.38.0
# [55] S4Vectors_0.42.0 BiocGenerics_0.50.0
# [57] tximeta_1.22.1 macrophage_1.20.0
# [59] rmarkdown_2.27 knitr_1.47
# [61] BiocStyle_2.32.1
#
# loaded via a namespace (and not attached):
# [1] igraph_2.0.3 ica_1.0-3
# [3] plotly_4.10.4 ca_0.71.1
# [5] listviewer_4.0.0 zlibbioc_1.50.0
# [7] tidyselect_1.2.1 bit_4.0.5
# [9] doParallel_1.0.17 clue_0.3-65
# [11] lattice_0.22-6 rjson_0.2.21
# [13] blob_1.2.4 rngtools_1.5.2
# [15] parallel_4.4.0 png_0.1-8
# [17] cli_3.6.2 ggplotify_0.1.2
# [19] registry_0.5-1 GOstats_2.70.0
# [21] ProtGenerics_1.36.0 rintrojs_0.3.4
# [23] goftest_1.2-3 BiocIO_1.14.0
# [25] bs4Dash_2.3.3 bluster_1.14.0
# [27] BiocNeighbors_1.22.0 uwot_0.2.2
# [29] tximport_1.32.0 dendextend_1.17.1
# [31] shadowtext_0.1.3 curl_5.2.1
# [33] mime_0.12 evaluate_0.24.0
# [35] tidytree_0.4.6 leiden_0.4.3.1
# [37] ComplexHeatmap_2.20.0 stringi_1.8.4
# [39] XML_3.99-0.16.1 httpuv_1.6.15
# [41] magrittr_2.0.3 rappdirs_0.3.3
# [43] splines_4.4.0 mclust_6.1.1
# [45] ggraph_2.2.1 webshot_0.5.5
# [47] DT_0.33 sctransform_0.4.1
# [49] ggbeeswarm_0.7.2 DBI_1.2.3
# [51] jquerylib_0.1.4 genefilter_1.86.0
# [53] withr_3.0.0 emdbook_1.3.13
# [55] lmtest_0.9-40 enrichplot_1.24.0
# [57] RBGL_1.80.0 GSEABase_1.66.0
# [59] bdsmatrix_1.3-7 tidygraph_1.3.1
# [61] formatR_1.14 colourpicker_1.3.0
# [63] rtracklayer_1.64.0 BiocManager_1.30.23
# [65] htmlwidgets_1.6.4 fs_1.6.4
# [67] biomaRt_2.60.0 ggrepel_0.9.5
# [69] labeling_0.4.3 shinycssloaders_1.0.0
# [71] reticulate_1.38.0 annotate_1.82.0
# [73] hexbin_1.28.3 zoo_1.8-12
# [75] XVector_0.44.0 UCSC.utils_1.0.0
# [77] timechange_0.3.0 foreach_1.5.2
# [79] fansi_1.0.6 patchwork_1.2.0
# [81] visNetwork_2.1.2 grid_4.4.0
# [83] data.table_1.15.4 ggtree_3.12.0
# [85] seriation_1.5.5 RSpectra_0.16-1
# [87] irlba_2.3.5.1 alabaster.schemas_1.4.0
# [89] fastDummies_1.7.3 gridGraphics_0.5-1
# [91] lazyeval_0.2.2 yaml_2.3.8
# [93] survival_3.7-0 scattermore_1.2
# [95] BiocVersion_3.19.1 crayon_1.5.3
# [97] RcppAnnoy_0.0.22 progressr_0.14.0
# [99] RColorBrewer_1.1-3 tweenr_2.0.3
# [101] later_1.3.2 Rgraphviz_2.48.0
# [103] ggridges_0.5.6 codetools_0.2-20
# [105] base64enc_0.1-3 GlobalOptions_0.1.2
# [107] Seurat_5.1.0 KEGGREST_1.44.1
# [109] bbmle_1.0.25.1 Rtsne_0.17
# [111] shape_1.4.6.1 icons_0.2.0
# [113] Rsamtools_2.20.0 filelock_1.0.3
# [115] shinyBS_0.61.1 pkgconfig_2.0.3
# [117] xml2_1.3.6 GenomicAlignments_1.40.0
# [119] aplot_0.2.3 alabaster.base_1.4.1
# [121] spatstat.sparse_3.1-0 ape_5.8
# [123] viridisLite_0.4.2 gridBase_0.4-7
# [125] xtable_1.8-4 Category_2.70.0
# [127] highr_0.11 plyr_1.8.9
# [129] httr_1.4.7 globals_0.16.3
# [131] tools_4.4.0 SeuratObject_5.0.2
# [133] beeswarm_0.4.0 AnnotationForge_1.46.0
# [135] nlme_3.1-165 xaringan_0.30
# [137] HDO.db_0.99.1 dbplyr_2.5.0
# [139] ExperimentHub_2.12.0 shinyjs_2.1.0
# [141] crosstalk_1.2.1 ComplexUpset_1.3.3
# [143] assertthat_0.2.1 digest_0.6.35
# [145] numDeriv_2016.8-1.1 bookdown_0.39
# [147] farver_2.1.2 tzdb_0.4.0
# [149] AnnotationFilter_1.28.0 reshape2_1.4.4
# [151] yulab.utils_0.1.4 viridis_0.6.5
# [153] glue_1.7.0 cachem_1.1.0
# [155] BiocFileCache_2.12.0 polyclip_1.10-6
# [157] generics_0.1.3 Biostrings_2.72.1
# [159] dynamicTreeCut_1.63-1 mvtnorm_1.2-5
# [161] parallelly_1.37.1 txdbmaker_1.0.0
# [163] statmod_1.5.0 RcppHNSW_0.6.0
# [165] ScaledMatrix_1.12.0 pbapply_1.7-2
# [167] httr2_1.0.1 spam_2.10-0
# [169] backbone_2.1.4 vroom_1.6.5
# [171] gson_0.1.0 dqrng_0.4.1
# [173] utf8_1.2.4 graphlayouts_1.1.1
# [175] alabaster.se_1.4.1 fontawesome_0.5.2
# [177] gridExtra_2.3 shiny_1.8.1.1
# [179] GenomeInfoDbData_1.2.12 rhdf5filters_1.16.0
# [181] RCurl_1.98-1.14 memoise_2.0.1
# [183] scales_1.3.0 threejs_0.3.3
# [185] gypsum_1.0.1 future_1.33.2
# [187] RANN_2.6.1 spatstat.data_3.1-2
# [189] bigmemory.sri_0.1.8 rstudioapi_0.16.0
# [191] cluster_2.1.6 spatstat.utils_3.0-5
# [193] fitdistrplus_1.1-11 hms_1.1.3
# [195] munsell_0.5.1 cowplot_1.1.3
# [197] colorspace_2.1-0 rlang_1.1.4
# [199] DelayedMatrixStats_1.26.0 sparseMatrixStats_1.16.0
# [201] shinyWidgets_0.8.6 dotCall64_1.1-1
# [203] shinydashboard_0.7.2 ggforce_0.4.2
# [205] circlize_0.4.16 mgcv_1.9-1
# [207] xfun_0.45 alabaster.matrix_1.4.1
# [209] heatmaply_1.5.0 coda_0.19-4.1
# [211] iterators_1.0.14 GOSemSim_2.30.0
# [213] treeio_1.28.0 Rhdf5lib_1.26.0
# [215] bitops_1.0-7 promises_1.3.0
# [217] scatterpie_0.2.3 RSQLite_2.3.7
# [219] qvalue_2.36.0 TSP_1.2-4
# [221] fgsea_1.30.0 compiler_4.4.0
# [223] alabaster.ranges_1.4.1 prettyunits_1.2.0
# [225] beachmat_2.20.0 listenv_0.9.1
# [227] Rcpp_1.0.12 tippy_0.1.0
# [229] tensor_1.5 AnnotationHub_3.12.0
# [231] BiocSingular_1.20.0 progress_1.2.3
# [233] uuid_1.2-0 BiocParallel_1.38.0
# [235] spatstat.random_3.2-3 archive_1.1.8
# [237] R6_2.5.1 fastmap_1.2.0
# [239] fastmatch_1.1-4 ROCR_1.0-11
# [241] vipor_0.4.7 ensembldb_2.28.0
# [243] rsvd_1.0.5 gtable_0.3.5
# [245] shinyAce_0.4.2 KernSmooth_2.23-24
# [247] miniUI_0.1.1.1 deldir_2.0-4
# [249] htmltools_0.5.8.1 bit64_4.0.5
# [251] spatstat.explore_3.2-7 lifecycle_1.0.4
# [253] restfulr_0.0.15 sass_0.4.9
# [255] vctrs_0.6.5 rsconnect_1.3.1
# [257] spatstat.geom_3.2-9 DOSE_3.30.1
# [259] NMF_0.27 ggfun_0.1.5
# [261] emo_0.0.0.9000 sp_2.1-4
# [263] future.apply_1.11.2 bslib_0.7.0
# [265] pillar_1.9.0 metapod_1.12.0
# [267] locfit_1.5-9.9 jsonlite_1.8.8
# [269] expm_0.999-9 GetoptLong_1.0.5